There have been many analysis done on movie datasets but almost none of them were focused on specific questions they tried to answer. Through our analysis we wish to answer/test 3 major questions.
They are:
Are movie production houses looking at the old patterns of movie genres and their financial success and are trying to replicate their success by making more movies of similar genre?
Is Martin Scorsese’s recent claim about the superhero movies right? If yes then what is the possible solution?
How have the most bankable directors of the older generation been performing in the past 2 decades? Who among them is the most versatile and most profitable?
Context for question 2: Martin Scorsese recently claimed that the superhero movies have been taking over the theaters and are overshadowing other good movies that released during the same time. He said he feels that they are like these amuzement parks and not like the cinema he grew up loving. This opinion caused a lot of raised eyebrows.
Conext for question 3: The bankability of the directors (in our analysis) is measured by the revenues their movies generated. The higher the revenue, higher the bankability.
This project uses two different datasets (the normal dataset, and the superhero movie dataset), both taken from different sources.
The normal dataset was created by scraping the movie data directly from TMDB : The Movie DataBase. We could have gone with the IMDB dataset but the issue with it was that it had the genres of a movie ordered alphabetically, TMDB had them ordered according to the relevance to the movie. TMDB is an online database containing data on tens of thousands of movies released over the past 50 years. It is a useful source of movie data for developers since it has a pre-built API which allows anyone to scrape data on any movie in the database, provided they create an account and receive an API key. We have used the API to collect the top 100 highest grossing movies of each year, for the years from 1980 to 2019. This scraping process was done in Python see new_dataloader.py file on GitHub through this we collected the ‘id’, ‘original_title’, ‘popularity’, ‘budget’, ‘revenue’, ‘genres’, ‘vote_count’, ‘vote_average’, ‘runtime’, ‘release_date’ and ‘director’ values for 3959 different movies (Though we were supposed to get 4000 movies, the API coudn’t extract some movies). It is worth noting that TMDB has its own popularity metric as well as a voting system to evaluate whether the public enjoyed each of the movies. The vote count and the vote average are the number of votes and the average of the vote values the movie received. These votes are given by the users of TMDB API.
The second dataset contains data on movies based solely on comics superhero characters, and was hand-built by scraping data from Wikipedia, IMDB, Rotten Tomatoes and Box Office Mojo. We decided to focus on superhero movies only containing characters from Marvel and DC comics, as these studios have produced the overwhelming majority of superhero movies in the past two decades. As was the case for the main dataset, the scraping was done in Python see superhero_scrape.py file on GitHub.
The first step to creating the superhero movie dataset was identifying the titles of all movies based on Marvel or DC characters. This was done by scraping the table of movies from the “List_of_films_based_on_Marvel_Comics_publications” and “List_of_films_based_on_DC_Comics_publications” Wikipedia pages - using the BeautifulSoup Python library See ty.py on GitHub. Once the names of these movies were identified, we scraped the IMDB, Box Office Mojo and Rotten Tomatoes websites of each movie page. We were able to find all IMDB pages automatically, but for some 30 Rotten Tomatoes pages, and 40 Box Office Mojo pages the automatic linking did not work. For those movies, we manually searched for copied the links to each page in the Python script.
Once all links were copied into the script, we scraped each page to obtain the relevant information. Since no single website contained all the information we were after, we had to scrape different types of data from different websites (IMDB, Rotten Tomatoes or Box Office Mojo pages). From IMDB, we downloaded the IMDB score, release date, budget, opening weekend gross, total domestic gross and worldwide gross. From Rotten Tomatoes we downloaded the critic score and the audience score. It is worth noting that Rotten Tomatoes was the only website to offer a critic score, and not just a score that any user can contribute to. Finally, from Box Office Mojo we downloaded the number of theaters that each superhero movie was displayed in.
The final “clean_superhero.csv” dataset contains the title, IMDB link, Rotten Tomatoes link, Box Office Mojo link, IMDB rating, release date, budget, Rotten Tomatoes critic score, Rotten Tomatoes audience score, number of theaters displaying the movie, comic book studio, domestic opening weekend gross, total domestic gross and worldwide gross for 89 superhero movies released between 1950 and 2019.
In two instances, we used external datasets to provide information on average movie ticket prices over time (to check the effect of inflation on our data) and the total number of movie theaters in the US over time (to check the ratio of theaters in the US that displayed a specific movie).
Both datasets were taken from the National Association of Theatre Owner, a private company headquartered in Washington, D.C.
NATO represents some 33,000 movie screens in all 50 states. Cinema chains subscribe to NATO membership and are required to provide information on the ticket prices for the movies they display as well as specific information on individual movie theaters.
For movie ticket price and number of theaters, the data comes directly from the companies which own movie theatres in the US.
The TMDB data on the other hand is completely user contributed. TMDB is a free and open movie database, meaning any user can submit an incorrect data issue, or click “Edit” on any movie to access the editing interface.
Content issue reports are mostly handled by content moderators who are volunteers. Moderators then work on locking and unlocking data, deleting entries, seasons/episodes, images and keywords, resetting URLs and primary posters.
The major reason we selected TMDB database instead of IMDB is because IMDB database has the genres of the movies listed in alphabetical order, but TMDB listed them in the order of relevance to the movie.
Some part of the data cleaning was performed in python while scraping itself. For example, in the superhero movie dataset, we removed the “$” signs from the revenue and budget columns, converted numbers in “900,000,000” format to “900000000”, and added in a column containing only the release year of each superhero movie, to avoid having to extract the year from the release date in R.
In R we extracted the release year from the release date and converted it into numeric data to facilitate our analysis. We sometimes had to remove the “,” comma signs from the revenue data. Once this is done we converted the revenue data to numeric.
Some NA values in this dataset have been imputed to 0s in certain columns and None in other columns, we make them back into NAs. To be specific some NAs in the budget, and revenue were imputed to 0 and the NAs in the runtime were imputed to ‘None’. Since we know that the budget, revenue and runtime certainly cannot be equal to zero or None, we can safely conclude that they have been imputed to 0s and make them back to NA again for our analysis. The genres and director columns also have NA values, which are present as empty strings, we are making them into NAs as well.
Now we will see the percentage of NA values in each column of the dataset.
dfg=read.csv('dataset1_4000.csv',sep=";")
dfg$budget <- replace(dfg$budget, dfg$budget == 0, NA)
dfg$revenue <- replace(dfg$revenue, dfg$revenue == 0, NA)
dfg$runtime <- replace(dfg$runtime, dfg$runtime == 'None', NA)
dfg$genres=replace(dfg$genres, dfg$genres == '', NA)
pr1=colSums(is.na(dfg))/nrow(dfg)*100
sort(pr1,decreasing = TRUE)
## budget revenue runtime genres director
## 18.6158121 1.1619096 0.9598383 0.6819904 0.2525890
## id original_title popularity vote_count vote_average
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## release_date
## 0.0000000
We can see that the budget column has the highest missing values, followed by revenue, runtime, genres, and director. No other columns have any missing values.
Let us now try to visualize the missing patterns:
library(extracat)
nm1=dfg
colnames(nm1)=c("director","id","title","popularity","budget","revenue","genres","VC","VA","runtime","RD")
visna(nm1, sort='b')
Note: The names of some columns have been re-coded in the interest of space. (“title”=“original title”,“VC”=“Vote count”,“VA”=“Vote Average”, “RD”=“release date”).
We see something interesting here, even though from the plot it looks like many rows are missing runtime, genre, and the director details, we found on careful examination of the dataset that the missing values of the director column, genre column and the runtime column have a higher number of unique missing patterns than the other columns. Since visna compiles all the rows with similar missing patterns into one single row, and most of the missing patterns of director, genre and the runtime are unique, it looks like they appear more frequently when we look at the blue fills in their columns.
From the missing row patterns we can see that, of all the rows that are having missing values, most rows are missing budget details. The highest missing values are in the budget column, and then revenue and then runtime, then there are a few in genre and a few in director.
We can see that whenever revenue of a movie is missing, budget is also missing.
Let us now add an extra column to our dataset with its values as number of NAs in the row (i.e. number of missing value per movie). Let us also check our assumption that movies of bad quality (low total ratings (total rating values are obtained by multiplying no of votes with vote average of each movie)), or the movies that are less popular or the movies that are old have more NA values. We draw a scatter plot matrix for ratings score, year of release, popularity and no of NA values in the row to check if a strong correlation exists.
#Extracting the year of release
dfgg= dfg %>% mutate(year=as.numeric(substring(release_date,1,4)))
#Summing up all the NAs in a row
v=as.data.frame(rowSums(is.na(dfgg)))
dfg1=cbind(dfgg,v)
dfg1 <- dplyr::rename(dfg1, na_sum = `rowSums(is.na(dfgg))`)
dfg1=mutate(dfg1, tot_rating=vote_count*vote_average)
dfg2 = select(dfg1,popularity,year,tot_rating,na_sum)
#Selecting only the rows that have NA values
dfg3=filter(dfg2,dfg2$na_sum!=0)
ggpairs(dfg3,progress=FALSE)
Note: tot_rating and na_sum from the plots represent the total rating and the number of NA values for that movie.
Though there is a slight negative correlation between the popularity and number of NA values, release year and number of NA values, and total rating and the number of NA values; it is not strong enough to strengthen our assumption. But from the plots we can see that higher no of missing values (more than 2 missing values) in a row (per movie) are fairly concentrated for the movies that were released before 1990. It is the same with popularity and total rating too, i.e. there are more number of NAs for movies with popularity below 40, and we can also see that no movie with a popularity score over 50 actually has any NA value. There is only one exception for this which is the movie Frozen II which has a popularity score above 400 and is missing the budget details. Most of the movies that are having missing values are missing budget value irrespective of their popularity, year of release or total rating.
Let us now remove the rows containing NA values from our dataset for our analysis from here on. We will save this as our clean dataset, and since we have extracted the years of the movies from the release date (we did it while making the scatter plot matrix), this will have years column seperately in numeric format. Removing all the NA values removed 747 movies from the original dataset (which had 3959 movies) now we are left with 3212 movies.
new_df=na.omit(dfgg)
#We lost 747 movies after removing the NA values from the dataset
write.csv(new_df,"/Users/ap/Documents/EDAV_assignments/clean_data.csv", row.names = FALSE)
Let us repeat the same procedure for the superhero dataset, and see if there are any missingness patterns in this dataset. First let us see the percentage of missing values in each column of the dataset.
df=read.csv("clean_superhero.csv")
pr=colSums(is.na(df))/nrow(df)*100
sort(pr,decreasing = TRUE)
## opening_weekend_usa theaters gross_usa
## 6.896552 5.747126 5.747126
## gross_worldwide budget name
## 5.747126 2.298851 0.000000
## imdb rt bm
## 0.000000 0.000000 0.000000
## imdb_rating release_date critic_score_rt
## 0.000000 0.000000 0.000000
## audience_score_rt studio year
## 0.000000 0.000000 0.000000
We can see from the above output that we don’t have a lot of missing values in our dataset, and of the missing values we have most of them are missing opening weekend collection details. The “theaters” column has the number of theaters that were allocated to the movie. gross_usa and the gross_worldwide columns show the gross amount made by the movie in the US and worldwide. The budget column gives the budget of the movie. There are some missing values in the theaters column, and some in the gross_usa column and some in the gross_worldwide column and a very few in the budget column.
library(extracat)
nm=df
colnames(nm)=c("name","imdb","rt","bm","imdbr","RD","budget","CS","AS","theaters","studio","year","opw","Gusa","Gwrld")
visna(nm, sort='b')
Note: The names of some variables have been re-coded in the interest of space. (“opw”=“opening weekend gross”,“Gusa”=“overall domestic gross,”Gwrld“=”Worldwide gross“,”rt“=”rotten tomato link“,”imdbr“=”imdb rating“,”RD“=”release date“,”CS“=”Critics Score“,”AS“=”Audience Score").
We can see from the plot that most of the rows have all the values. In the rows having missing values, the most common missing pattern is the worldwide gross, US gross, number of theaters and opening weekend. We can see that whenever budget is missing worldwide gross, USA gross, theater allocation, and opening weekend collection details are missing.
Whenever the domestic gross is missing, the worldwide gross and the opening weekend collection are missing (which makes sense as without data on opening weekend gross data we can’t calculate the domestic gross and without domestic gross we can’t calculate the worldwide gross). Whenever the theater allocation details are missing the opening weekend data is also missing.
#Extracting release years from the release dates of the movies
dff1=df %>% mutate(year = substr(release_date, nchar(as.character(release_date))-10+1, nchar(as.character(release_date))))
dff1 = dff1 %>% mutate(year_of_release = substr(year, 1,4))
dff1$year_of_release=as.numeric(dff1$year_of_release)
Let us now see if there is any correlation between the release year, imdb score and the number of NAs in the row of the movie.
library(GGally)
dff1=dff1 %>% mutate(NAs=rowSums(is.na(dff1)))
dff2 = select(dff1,year_of_release,imdb_rating,NAs)
#taking only the movies that have NA values in their rows
dff3=filter(dff2,dff2$NAs!=0)
ggpairs(dff3)
We can see that there is a strong negative correlation between the year of release and the number of NA values for the movies, that have missing values. It strengthens our suspicion that the missing budget, revenue, theater data values are because of the lack of records for the older superhero movies.
dff <- read.csv('clean_data.csv')
The dataset we are woking with is composed of 3212 rows and 12 columns, in this part of our analysis we are primarily focusing on the the “genre”, “vote average”, “budget”, “revenue” and “runtime” columns.
Note: The genre details of the movies in the datadet are given as “Action|Comedy|Adventure” etc, for our analysis we are considering the first genre (let us call this primary genre from here on) in the list, as it is the primary genre of the movie (The genres in the genre column of the dataset are present in the order of their relevance to the movie(ex: if a movie have 2 genres action and comedy, and if action is its primary genre, it is represented as “Action|Comedy”) and are not ordered alphabetically).
Let us now see the list of all the unique primary genres in our normal dataset.
#Only keep the first genre, which is the most relevant
genres1 <- do.call(rbind,strsplit(as.character(dff$genres),'\\|'))[,1]
df1 <- data.frame(genres = genres1, select(dff, -genres))
unique(as.character(df1$genres))
## [1] "Adventure" "Science Fiction" "Music"
## [4] "Comedy" "Drama" "Action"
## [7] "Horror" "Romance" "Crime"
## [10] "Mystery" "Fantasy" "War"
## [13] "Western" "Family" "Animation"
## [16] "Thriller" "Documentary" "History"
## [19] "TV Movie"
As we can see there are 19 unique primary genres in our dataset. Now let us see how many movies of each these genres are present in the dataset. This will help get an idea of the distribution of primary genres in our normal dataset.
#plot (number of movies for each genre)
genres_info <- df1 %>%
count(genres) %>%
mutate(perc = (n / nrow(df1))*100)
ggplot(genres_info, aes(x=reorder(genres,n), y=perc)) +
geom_bar(stat = 'identity') +
xlab('Genre') +
ylab('Percentage of movies') +
ggtitle('Percentage of movies in our dataset of each unique primary genre') +
coord_flip()
We can see that the most number of movies in our dataset have their primary genre as action. We will now take the top most present primary genres in our dataset for further analysis.
In our dataset, the 5 most present genres are: * Action (~20.7%) * Comedy (~19.8%) * Drama (~17.9%%) * Adventure (~9.71%%) * Horror (~6.63%%) We are going to focus our study on these genres, which represent ~75% of the whole dataset.
Let us first look at how the average budget of movies made in each of these 5 genres has been changing over the years.
Note: When hovering over the line we can see the budget, year, and the genre. Clicking on the legend will remove the line of that genre and will provide for a better view of other line plots. If you want to separately see the trend in the budget of each genre please turn off the line of the other genres by clicking on their legend.
From the overall trend we can see that the average budget allocated for movies of each genre has been increasing (this might also be happening because of inflation). The average budget for “Adventure” movies has increased at a faster pace then the average budget for the movies of other genres we are considering.
Let us now see if the same order is maintained in the average revenue generated by movies made in each of these 5 genres.
resultss$genres <- factor(resultss$genres, levels = c('Adventure', 'Action', 'Horror', 'Comedy','Drama'))
o1=ggplot(data = resultss) +
geom_line(aes(x = year, y = revenue, color=genres)) +
scale_color_colorblind() +
ggtitle('Evolution of average revenue over the years') + labs(x = "year", y = "Revenue (in millions of dollars)")
ggplotly(o1) %>% config(displayModeBar = F)
From the graph we can see that, the average revenue generated by the genres also has been increasing over the years and the ordering of the genres is more or less the same. The movies of “Adventure” genre have generated more revenue than other genres in most of the years, except the years 1982, 1988, 1994, 2009 and 2015.
Let us now look at the average Return on Invesment for each of the 5 genres we are considering. Return on investment is calculated by subtracting the budget from revenue and dividing this difference by the budget. This will signify the profit each movie made for one dollar of its investment. We then group the movies by their primary genres and see which genre gives high return on investment.
#x axis is the genre, bar graph
df3=read.csv("main_ds_without_superhero.csv")
#Removing one movie whose budget is misrepresented as 2 dollars
df3<-df3[!(df3$budget==2 ),]
#Removing movie with wrong details
df3<-df3[!(df3$id==506972 ),]
df3<-df3[!(df3$id==506664 ),]
df3<-df3[!(df3$id==503314 ),]
genres1 <- do.call(rbind,strsplit(as.character(df3$genres),'\\|'))[,1]
df4 <- data.frame(genres = genres1, select(df3, -genres))
df4 <- select(df4,genres,revenue,budget,original_title,release_date)
df4 <- filter(df4, genres %in% as.vector(genres_under_study))
df4 <- dplyr::mutate(df4,ROI = (revenue-budget)/budget)
df4 %>%
dplyr::group_by(genres) %>%
dplyr::summarise(average_ROI = mean(ROI)) -> resultss2
ggplot(resultss2) +
geom_bar(aes(x=reorder(genres,-average_ROI), y=average_ROI), stat='identity') +
xlab('Genres') +
ylab('Average Return on Investment')+
ggtitle('Average Return on Investment of each of the 5 genres we are considering')
Note: ROI doesn’t have a unit of measurement, in general the higher the ROI, the more value we receive for investment.
After removing the suspicious movies that did not have consistent budget and revenue data, we end up with this graph. We can see that horror movies have a very high average return on investment. This is because they are not expensive to make, and if they are good they generate a huge revenue. After filtering and checking the data, we found that the movie with the highest ROI is “Paranormal Activity 2009” which had a budget of 15,000 dollars and made 193 million dollars globally. The second highest is “The Blair Witch Project” which is the first ever found footage genre movie. It was made on a budget of 60,000 dollars and generated a revenue of 248 million dollars. These movies increased the average ROI of the Horror genre. Even without these movies “Horror” movies have a higher ROI than other genres. When observing the comedy genre, we found that a Japanese zombie comedy movie called “One Cut of the Dead” was made on a budget of 30,000 dollars and generated a revenue of 200 million dollars after its international release. This movie alone is one of the reaons for the high average ROI of comedy movies. Of the movies belonging to the Drama genre, the movie with highest ROI is a Hindi movie “Secret Superstar” it has an ROI of 479.
Let us now look at the average runtime of the movies of each primary genre we have chosen.
#Reference: https://edav.info/cleveland.html
theme_dotplot <- theme_bw(14) +
theme(axis.text.y = element_text(size = rel(.75)),
axis.ticks.y = element_blank(),
axis.title.x = element_text(size = rel(.75)),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.5),
panel.grid.minor.x = element_blank())
xd = data.frame(df2 %>% group_by(genres) %>% summarise(avg_rt=mean(runtime)))
# create the plot
ggplot(xd, aes(x = avg_rt, y = reorder(genres, avg_rt))) +
geom_point(color = "blue", size=3) +
scale_x_continuous(limits = c(90, 125),
breaks = seq(90, 125, 10)) +
theme_dotplot +
xlab("Average runtime in minutes") +
ylab("Genres") +
ggtitle("Average runtime by primary genre") +
theme(axis.title.x = element_text(size=15)) +
theme(axis.title.y = element_text(size=15)) +
theme(axis.text.x = element_text(size=12)) +
theme(axis.text.y = element_text(size=12))
From this we can see that Horror movies on average have the smallest runtime of the genres we have chosen. Action and Adventure movies have almost the same average runtime and Dramas have the longest runtime of them all.
Let us now see how many of the top 100 highest grossing movies of each year (our dataset has top 100 highest grossing movies of each year from 1980 to 2019, we have lost some movies after removing NA values) have one of the 5 genres we’re considering (Action, Adventure, Comedy, Drama, Horror) as main genre (Since we have extracted only the main genre from the genre column of each movie). Since we have the data from 1980 to 2019, we would have to draw one bar graph per year to show this, which is unnecessary. Let us instead visualize these graphs in animation:
#reference https://towardsdatascience.com/create-animated-bar-charts-using-r-31d09e5841da
cv= data.frame(df2 %>% dplyr::group_by(year,genres) %>% dplyr::summarise(cnt = n()) %>% ungroup())
#giving ranking to cv for the order maintenance
cv_formatted <- cv %>%
dplyr::group_by(year) %>%
#breaks ties randomly
dplyr::mutate(rank = rank(-cnt,ties.method = "random"),
cnt_lbl = paste(" ",cnt)) %>% dplyr::ungroup()
#trial=filter(cv_formatted, year<1983)
sp = ggplot(cv_formatted, aes(rank, group = genres,
fill = as.factor(genres), color = as.factor(genres))) + scale_colour_colorblind() + scale_fill_colorblind()+
geom_tile(aes(y = cnt/2,
height = cnt,
width = 0.8), alpha = 0.9, color = NA) +
geom_text(aes(y = 0, label = paste(genres, " ")), vjust = 0.2, hjust = 1, size=7) +
geom_text(aes(y=cnt,label = cnt_lbl, hjust=0),size=7) +
coord_flip(clip = "off", expand = FALSE) +
scale_y_continuous(labels = scales::comma) +
scale_x_reverse() +
guides(color = FALSE, fill = FALSE) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.position="none",
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.x = element_line( size=.1, color="grey" ),
panel.grid.minor.x = element_line( size=.1, color="grey" ),
plot.title=element_text(size=23, hjust=0.55, face="bold", colour="grey", vjust=-1),
plot.background=element_blank(),
plot.margin = margin(2,2, 2, 4, "cm"))
anim = sp + transition_states(year, transition_length = 3, state_length = 2) +
view_follow(fixed_x = TRUE) +
labs(title = 'No of movies of our 5 primary genres made in Year: {closest_state}')
animate(anim, duration=70, fps = 20, width = 1500, height = 1000,
renderer = gifski_renderer())